A worked basic example of a survival analysis in R

Author

Anshu Uppal

Published

May 14, 2025

Code
/* set DT table fontsizes */
th { font-size: 11px; } /* header font */
td { font-size: 11px; } /* cell font */

Background

I used the rotterdam dataset from the survival package, which includes 2982 primary breast cancers patients whose records were included in the Rotterdam tumor bank.

This dataset records two events: disease relapse and death. Follow-up time is provided for each, with the origin being the time of the primary surgery.

In the article from which this data stems, the authors restricted their dataset to the 1546 patients who had node-positive disease (“Since the validation dataset comprised only node-positive patients and nodal status is an important prognostic factor, we omitted the node-negative patients to create the derivation dataset”). I decided to also restrict the dataset to only node-positive patients.

See:
Royston, P., Altman, D.G. External validation of a Cox prognostic model: principles and methods. BMC Med Res Methodol 13, 33 (2013). https://doi.org/10.1186/1471-2288-13-33

Prognostic factors

Quoted paragraph from the article:
“Candidate prognostic variables in the breast cancer datasets were age at primary surgery (age, years), menopausal status (meno, 0 = premenopausal, 1 = postmenopausal), tumour size (size), tumour grade (grade), number of positive lymph nodes (nodes), progesterone receptors (pgr, fmol/l), oestrogen receptors (er, fmol/l), hormonal treatment (hormon, 0 = no, 1 = yes), and chemotherapy (chemo). Tumour size (mm) was not available as a continuous variable in the Rotterdam dataset, therefore a standard coding was used; the base category was ≤ 20 mm and two dummy variables were used, namely 20 to 50 mm (sized1) and > 50 mm (sized2). We excluded grade, since it was measured according to a different protocol in the two datasets, and chemo, since all patients in the validation dataset received chemotherapy.

Code
# install.packages("pacman")
pacman::p_load(
        here,
        tidyverse,
        survival,
        survminer,
        ggsurvfit,
        janitor,
        DT,
        patchwork # easily combine plots
)

# Custom package with useful function for summarising dataframe
# pak::pak("UEP-HUG/UEPtools")

# Load in the rotterdam dataset from the survival package
rotterdam <- survival::rotterdam |> 
        # filter out node-negative patients
        filter(nodes > 0) # as "nodal status is an important prognostic factor, we omitted the node-negative patients to create the derivation dataset"

# Add in the variable description
rotterdam_variables <- tribble(
        ~name, ~value,
        "pid", "patient identifier", 
        "year", "year of surgery", 
        "age", "age at surgery", 
        "meno", "menopausal status (0= premenopausal, 1= postmenopausal)", 
        "size", "tumor size, a factor with levels <=20, 20-50, >50", 
        "grade", "differentiation grade", 
        "nodes", "number of positive lymph nodes", 
        "pgr", "progesterone receptors (fmol/l)", 
        "er", "estrogen receptors (fmol/l)", 
        "hormon", "hormonal treatment (0=no, 1=yes)", 
        "chemo", "chemotherapy", 
        "rtime", "days to relapse or last follow-up", 
        "recur", "0= no relapse, 1= relapse", 
        "dtime", "days to death or last follow-up", 
        "death", "0= alive, 1= dead"
)

Inspect Rotterdam data

I went through the package’s documentation for the dataset to also include the descriptions of the variables, and used a custom function to summarise the dataset:

Code
rotterdam |> 
        UEPtools::metadata_generator_any(variable_names = rotterdam_variables) |> 
        select(-Num_Values) |> 
        DT::datatable(
                filter = "top",
                options = list(
                        pageLength = 26
                ),
                rownames = FALSE, # set to FALSE for cleaner look
                class = 'cell-border stripe hover nowrap compact'
        )

Process dataset for downstream analysis

Create new outcome variable rd, where 0= alive without relapse, 1= relapse or death.
Also create rd_time variable where I calculate days to first of relapse, death or last follow-up

Code
rotterdam_cleaned <- rotterdam |> 
        mutate(
                # Convert several variables to factor
                meno = fct_recode(factor(meno), premenopausal = "0", postmenopausal = "1"),
                grade = factor(grade),
                hormon = fct_recode(factor(hormon), No = "0", Yes = "1"),
                chemo = fct_recode(factor(chemo), No = "0", Yes = "1"),
                
                # Add categorical variable for number of nodes
                nodes_cat = factor(
                        case_when(
                                nodes >0 & nodes <4 ~ "1-3",
                                nodes >= 4 & nodes < 9 ~ "4-9",
                                nodes >=9 ~ "9+"
                        )),
                
                # `rd` where 0= alive without relapse, 1= relapse or death
                rd = case_when(recur==1|death==1 ~ 1, .default = 0),
                # Calculate days to first of relapse, death or last follow-up
                rd_time = case_when(recur==1 ~ rtime, .default = dtime),
                # Add categorical age variable
                age_cat = factor(case_when( 
                        age < 65  ~ "24-64",
                        age >= 65   ~ "65+"))
        )

# Update the rotterdam_variables object
rotterdam_variables <- bind_rows(
        rotterdam_variables,
        tribble(
                ~name, ~value,
                "rd", "0= alive without relapse, 1= relapse or death",
                "rd_time", "days to relapse/death or last follow-up",
                "age_cat", "age category at surgery")
)

Visualise the distribution of numeric variables

Code
p1 <- rotterdam_cleaned |> ggplot()+geom_histogram(aes(x=age))+labs(x="age at surgery")
p2 <- rotterdam_cleaned |> ggplot()+geom_bar(aes(x=size))+labs(x="tumor size")
p3 <- rotterdam_cleaned |> ggplot()+geom_histogram(aes(x=nodes))+labs(x="number of positive lymph nodes")
p4 <- rotterdam_cleaned |> ggplot()+geom_histogram(aes(x=pgr))+labs(x="progesterone receptors (fmol/l)")
p5 <- rotterdam_cleaned |> ggplot()+geom_histogram(aes(x=er))+labs(x="estrogen receptors (fmol/l)")
p6 <- rotterdam_cleaned |> ggplot()+geom_bar(aes(x=hormon))+labs(x="hormonal treatment")
p7 <- rotterdam_cleaned |> ggplot()+geom_bar(aes(x=meno))+labs(x="menopausal status")
p8 <- rotterdam_cleaned |> ggplot()+geom_histogram(aes(x=rd_time))+labs(x="days to relapse/death or last follow-up")

p1+p2+p3+p4+p5+p6+p7+p8+ plot_layout(ncol = 2)

Analysis

Kaplan-Meier curve

Relapse-free survival probability at different timepoints:

Code
# Fit a Surv-type object for right-censored data
rotterdam_surv_fit_rd <-  
        # survival::survfit(Surv(rd_time, rd) ~ 1, data = rotterdam_cleaned)
        # Using survfit2 from the ggsurvfit package allows easier control for 
        # downstream analysis using the ggsurvfit() function:
        ggsurvfit::survfit2(Surv(rd_time, rd) ~ 1, data = rotterdam_cleaned)

# Print its summary at specific times
summary(rotterdam_surv_fit_rd, times = seq(1000,7000,1000))
Call: survfit(formula = Surv(rd_time, rd) ~ 1, data = rotterdam_cleaned)

 time n.risk n.event survival std.err lower 95% CI upper 95% CI
 1000    928     608    0.606  0.0124       0.5819        0.631
 2000    573     289    0.413  0.0127       0.3889        0.439
 3000    321     105    0.327  0.0125       0.3033        0.353
 4000    112      62    0.243  0.0133       0.2183        0.271
 5000     24      14    0.186  0.0180       0.1538        0.225
 6000      3       2    0.144  0.0316       0.0936        0.221
 7000      1       0    0.144  0.0316       0.0936        0.221

Mean survival time:

Code
print(rotterdam_surv_fit_rd, print.rmean = TRUE)
Call: survfit(formula = Surv(rd_time, rd) ~ 1, data = rotterdam_cleaned)

        n events rmean* se(rmean) median 0.95LCL 0.95UCL
[1,] 1546   1080   2485        79   1441    1301    1583
    * restricted mean with upper limit =  7027 

Overall plot

Code
rotterdam_surv_fit_rd |> 
        ggsurvfit(color = "tomato")+
        add_censor_mark(color = "tomato", size = 1.5, alpha = 0.8) +
        add_confidence_interval(fill = "tomato") +
        add_risktable(risktable_stats = c("n.risk", "cum.censor", "cum.event")) +
        add_quantile(y_value = 0.5, color = "gray60", linewidth = 0.5) +
        scale_ggsurvfit()+
        labs(title = "Relapse-free survival after surgery")

Group comparison: menopausal status (example)

Survival probabilities:

Code
rotterdam_surv_fit_meno <-  ggsurvfit::survfit2(Surv(rd_time, rd) ~ meno, data = rotterdam_cleaned)

summary(rotterdam_surv_fit_meno, times = seq(1000,7000,1000))
Call: survfit(formula = Surv(rd_time, rd) ~ meno, data = rotterdam_cleaned)

                meno=premenopausal 
 time n.risk n.event survival std.err lower 95% CI upper 95% CI
 1000    404     222    0.646  0.0191        0.610        0.684
 2000    268     108    0.469  0.0201        0.432        0.511
 3000    159      37    0.395  0.0203        0.357        0.437
 4000     61      25    0.316  0.0218        0.276        0.362
 5000     17       3    0.280  0.0281        0.230        0.341
 6000      2       1    0.257  0.0341        0.198        0.333
 7000      1       0    0.257  0.0341        0.198        0.333

                meno=postmenopausal 
 time n.risk n.event survival std.err lower 95% CI upper 95% CI
 1000    524     386   0.5783  0.0163       0.5472        0.611
 2000    305     181   0.3740  0.0162       0.3436        0.407
 3000    162      68   0.2799  0.0157       0.2508        0.312
 4000     51      37   0.1929  0.0164       0.1634        0.228
 5000      7      11   0.1191  0.0226       0.0821        0.173
 6000      1       1   0.0596  0.0436       0.0142        0.250

Mean survival time:

Code
# Print mean survival time
print(rotterdam_surv_fit_meno, print.rmean = TRUE)
Call: survfit(formula = Surv(rd_time, rd) ~ meno, data = rotterdam_cleaned)

                      n events rmean* se(rmean) median 0.95LCL 0.95UCL
meno=premenopausal  628    396   2991     125.8   1749    1510    2141
meno=postmenopausal 918    684   2130      93.6   1275    1172    1442
    * restricted mean with upper limit =  7027 

KM curve:

Code
rotterdam_surv_fit_meno |> 
        ggsurvfit()+
        add_censor_mark(size = 1.5, alpha = 0.8) +
        add_confidence_interval() +
        add_risktable(risktable_stats = c("n.risk", "cum.censor", "cum.event")) +
        add_quantile(y_value = 0.5, color = "gray50", linewidth = 0.5) +
        scale_ggsurvfit()+
        add_pvalue()+
        labs(title = "Relapse-free survival after surgery")

Log-rank test

Use a log-rank test, which accounts for the whole follow-up period. By comparing the observed number of events in each group with the number that would be expected if event rates were the same, a chi-squared test is used to determine if any observed differences are statistically meaningful.

Code
survdiff(Surv(rd_time, rd) ~ meno, rho = 0, data = rotterdam_cleaned) 
Call:
survdiff(formula = Surv(rd_time, rd) ~ meno, data = rotterdam_cleaned, 
    rho = 0)

                      N Observed Expected (O-E)^2/E (O-E)^2/V
meno=premenopausal  628      396      479      14.4        26
meno=postmenopausal 918      684      601      11.5        26

 Chisq= 26  on 1 degrees of freedom, p= 3e-07 

Regression modelling (CoxPH)

Visually inspect individual KM curves for covariates

Code
rotterdam_surv_fit_age_cat <-  survfit2(Surv(rd_time, rd) ~ age_cat, data = rotterdam_cleaned)
rotterdam_surv_fit_age_cat |> 
        ggsurvfit()+
        add_censor_mark(size = 1.5, alpha = 0.8) +
        add_confidence_interval() +
        add_risktable(risktable_stats = c("n.risk", "cum.censor", "cum.event")) +
        add_quantile(y_value = 0.5, color = "gray50", linewidth = 0.5) +
        scale_ggsurvfit()+
        add_pvalue()+
        labs(title = "Relapse-free survival after surgery")

Code
rotterdam_surv_fit_size <-  survfit2(Surv(rd_time, rd) ~ size, data = rotterdam_cleaned)
rotterdam_surv_fit_size |> 
        ggsurvfit()+
        add_censor_mark(size = 1.5, alpha = 0.8) +
        add_confidence_interval() +
        add_risktable(risktable_stats = c("n.risk", "cum.censor", "cum.event")) +
        add_quantile(y_value = 0.5, color = "gray50", linewidth = 0.5) +
        scale_ggsurvfit()+
        add_pvalue()+
        labs(title = "Relapse-free survival after surgery")

Code
rotterdam_surv_fit_hormon <-  survfit2(Surv(rd_time, rd) ~ hormon, data = rotterdam_cleaned)
rotterdam_surv_fit_hormon |> 
        ggsurvfit()+
        add_censor_mark(size = 1.5, alpha = 0.8) +
        add_confidence_interval() +
        add_risktable(risktable_stats = c("n.risk", "cum.censor", "cum.event")) +
        add_quantile(y_value = 0.5, color = "gray50", linewidth = 0.5) +
        scale_ggsurvfit()+
        add_pvalue()+
        labs(title = "Relapse-free survival after surgery")

Code
rotterdam_surv_fit_nodes_cat <-  survfit2(Surv(rd_time, rd) ~ nodes_cat, data = rotterdam_cleaned)
rotterdam_surv_fit_nodes_cat |> 
        ggsurvfit()+
        add_censor_mark(size = 1.5, alpha = 0.8) +
        add_confidence_interval() +
        add_risktable(risktable_stats = c("n.risk", "cum.censor", "cum.event")) +
        add_quantile(y_value = 0.5, color = "gray50", linewidth = 0.5) +
        scale_ggsurvfit()+
        add_pvalue()+
        labs(title = "Relapse-free survival after surgery")

Cox model specification

This is the full model using the variables specified in the original article (see section “Prognostic factors” above)

Code
rotterdam_cox <- 
        coxph(Surv(rd_time, rd) ~ age + meno + size + nodes + pgr + er + hormon,
              data = rotterdam_cleaned
        )
summary(rotterdam_cox)
Call:
coxph(formula = Surv(rd_time, rd) ~ age + meno + size + nodes + 
    pgr + er + hormon, data = rotterdam_cleaned)

  n= 1546, number of events= 1080 

                         coef  exp(coef)   se(coef)      z Pr(>|z|)    
age                 0.0045330  1.0045433  0.0039914  1.136  0.25608    
menopostmenopausal  0.2418071  1.2735485  0.1065947  2.268  0.02330 *  
size20-50           0.2855600  1.3305069  0.0738251  3.868  0.00011 ***
size>50             0.6124397  1.8449269  0.0941898  6.502 7.92e-11 ***
nodes               0.0558619  1.0574516  0.0052392 10.662  < 2e-16 ***
pgr                -0.0002002  0.9997998  0.0001237 -1.618  0.10560    
er                 -0.0002603  0.9997398  0.0001250 -2.082  0.03733 *  
hormonYes          -0.3228457  0.7240856  0.0809431 -3.989 6.65e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

                   exp(coef) exp(-coef) lower .95 upper .95
age                   1.0045     0.9955    0.9967    1.0124
menopostmenopausal    1.2735     0.7852    1.0334    1.5695
size20-50             1.3305     0.7516    1.1513    1.5376
size>50               1.8449     0.5420    1.5339    2.2190
nodes                 1.0575     0.9457    1.0466    1.0684
pgr                   0.9998     1.0002    0.9996    1.0000
er                    0.9997     1.0003    0.9995    1.0000
hormonYes             0.7241     1.3811    0.6179    0.8486

Concordance= 0.66  (se = 0.008 )
Likelihood ratio test= 235.5  on 8 df,   p=<2e-16
Wald test            = 266.1  on 8 df,   p=<2e-16
Score (logrank) test = 277.2  on 8 df,   p=<2e-16

Forest plot

Code
survminer::ggforest(rotterdam_cox, data = rotterdam_cleaned)

Assumptions: Proportional hazards

Schoenfeld residuals test:
  • For each covariate and globally, it tests the null hypothesis that the PH assumption holds (i.e., the slope of Schoenfeld residuals vs. time is zero).
  • A small p-value (e.g. < 0.05) for a covariate suggests that the PH assumption is violated for that variable.
  • The ‘GLOBAL’ test indicates if the assumption is violated for the overall model.
Code
rotterdam_test <- cox.zph(rotterdam_cox)
rotterdam_test
          chisq df       p
age     4.05333  1 0.04408
meno    0.00248  1 0.96026
size   13.17300  2 0.00138
nodes   6.73579  1 0.00945
pgr    15.02219  1 0.00011
er     24.55857  1 7.2e-07
hormon  2.99857  1 0.08334
GLOBAL 61.45472  8 2.4e-10

The ‘GLOBAL’ test appears to show that the PH assumption is violated for the overall model, and also appears to be violated for the age, size, nodes, and er variables.

Visual inspection of the test:
  • Plots the scaled Schoenfeld residuals against transformed time for each covariate.
  • A non-horizontal line with a non-zero slope suggests a violation of the PH assumption.
Code
survminer::ggcoxzph(rotterdam_test)

Taking into account the results of the test of the proportional hazards assumption, and also after visually inspecting the Schoenfeld residuals, it appears that the validity of the proportional hazards assumption is in doubt. The next steps may involve some combination of transformation of variables and/or consideration of including time-dependent interaction terms.